My initial reason for conducting this analysis was to learn more about using ggplot2 to generate choropleth maps. Additionally, I wanted to improve my visualization abilities using ggplot2. Ggplot2 is a great tool but requires considerable knowledge to use it easily and effectively.
As I started working through the analysis, I realized that the biggest challenge would be the data munging effort required. In many R-library learning exercises the data is pristine and well suited for the presented example. However, in the real world, even data that is sourced from a solid reputable source may have inconsistent, mislabeled and ill-formatted data. There is often sparse or missing documentation about the meta data and
For this analysis, I collected two data sets provided by the MIT Election Data Science Lab. One set contained the US election results for US States and the other contained election results for all US Counties.
Cite: MIT Election Data and Science Lab, 2017, “U.S. President 1976–2020”, https://doi.org/10.7910/DVN/42MVDX, Harvard Dataverse, V6, UNF:6:4KoNz9KgTkXy0ZBxJ9ZkOw== [fileUNF]
Geospatial data tables contained within the ggplot2 library contain the maps we need for the analysis. We will start with “state” table and later move on to collect the “counties” data.
usa_tbl <- map_data("state") %>% as_tibble()
After downloading the US Map table. We can Use the ggplot geom_map function to display our map. The plotted map shows all the US States. THe states are outlined in grey.
We also add in the coord_map function to properly show the correct projection of the map.
usa_tbl %>% ggplot(aes(long, lat, map_id = region)) +
geom_map(
map = usa_tbl,
color = "grey80", fill = "grey30", size = 0.3
) +
coord_map("ortho", orientation = c(19,-98,0))
Now we will perform the same type operation in ggplot2 by collecting USA Coutywide geospatial data.
Tables contained in the ggplot2 library use “county” table for geospatial detail.
Since the county data does not have a FIPS identifier, I will have to create a state-county identifier to make sure the county map data has a unique key. Just using a county name as identifier will not work since many counties have the same name across different states. This was the beginning of the data munging effort.
# make a state-county identifier
county_usa_tbl <- map_data("county") %>% as_tibble() %>%
unite(countyID, region, subregion, sep = "_", remove = FALSE)
The map displayed uses ggplot geom_map function and shows counties outlined in grey, Here again I added the coord_map function to properly show the map projection.
county_usa_tbl %>%
ggplot(aes(long, lat, map_id = region)) +
geom_map(
map = county_usa_tbl,
color = "grey80", fill = "grey30", size = 0.3
) +
coord_map("ortho", orientation = c(19,-98,0))
Reshaping the presidential data
# thin data to 2020 Presidential data
election_results <- read_csv("voting_data/1976-2020-president.csv", col_names = TRUE) %>%
filter(year == "2020") %>%
mutate(pct_votes = as.numeric(round(((candidatevotes/totalvotes)*100), digits = 3 ))) %>%
rename(region = state) %>%
mutate(region = tolower(region))
# narrow to just voting percentage by Republicans by state
republican_voting <- election_results %>%
filter(party_simplified == "REPUBLICAN") %>%
select(2, 15:16)
usa_republican_tbl <- usa_tbl %>%
left_join(republican_voting, by = "region")
usa_republican_tbl
usa_republican_tbl
# create a subregion to categorize choropleths
usa_republican_tbl %>%
ggplot(aes(long, lat, group = region)) +
geom_map(
aes(map_id = region),
map = usa_tbl,
color = "gray80", fill = "gray30", size = 0.3
) +
coord_map("ortho", orientation = c(38, -98, 0)) +
geom_polygon(aes(group = group, fill = pct_votes), color = "black") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 50) +
theme_minimal() +
labs(title = "Republican State Voting Percentages in 2020",
x = "", y = "", fill = ""
) +
theme(
plot.title = element_text(size = 20, face = "bold", color = "red"),
legend.position = "bottom"
)
This shows the state percentage level of republican voting. White areas reflect very close election results.
The data is filtered by those states that were very close margins based on voting percentages.
# divide state data into Republican and Democrat states
major_party_voting <- election_results %>%
select(2:4, 11:12, 15:16) %>%
filter(party_simplified == "REPUBLICAN" | party_simplified == "DEMOCRAT") %>%
filter(pct_votes > 47 & pct_votes < 52)
major_party_voting
# Grouped
ggplot(major_party_voting, aes(fill=party_simplified, y=pct_votes, x=state_po)) +
geom_bar(position="dodge", stat="identity")+
scale_fill_manual(values=c("blue",
"red")) +
labs(title="Major Party Presidential Voting",
subtitle="Presidential Data",
caption="Source: Presidential Election Data") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
coord_cartesian(ylim = c(47, 51)) +
theme(panel.grid = element_line(color = "grey",
size = 0.75,
linetype = 2)) +
geom_text(aes(label = pct_votes), vjust = -0.2, size = 3,
position = position_dodge(0.9)) +
labs(x = "State", y = "Percent Votes")
#round(pct_votes, digits = o)
The above chart illustrates the close voting margins for several states.
It essentially shows the white states on the choropleth map.
County level data is pulled from the MIT site.
Cite: MIT Election Data and Science Lab, 2018, “County Presidential Election Returns 2000-2020”, https://doi.org/10.7910/DVN/VOQCHQ, Harvard Dataverse, V9, UNF:6:qSwUYo7FKxI6vd/3Xev2Ng== [fileUNF]
Create a voting percentage field in the data set
Generate a table that illustrates the county-level Republican voting percentages
# change names to lower case values along with names
# create a unique county key
county_election_results <- read_csv("voting_data/countypres_2000-2020.csv", col_names = TRUE) %>%
filter(year == "2020")%>%
mutate(pct_votes = as.numeric(round(((candidatevotes/totalvotes)*100), digits = 3 ))) %>%
mutate(county_name = tolower(county_name)) %>%
rename(subregion = county_name)%>%
rename(region = state) %>%
mutate(region = tolower(region)) %>%
# need to create unique county name. Concatenated state and county
unite(countyID, region, subregion, sep = "_", remove = FALSE)
# narrow to just voting percentage by Republicans by county
# filter the data for the totals of votes for the republican candidate
# mode data segments by total, early, mail, election day. Filter to collect totals
county_republican_voting <- county_election_results %>%
select(2, 6, 9:11, 13, 14) %>%
filter(party == "REPUBLICAN")
Join the Republican county voting table to the county map table for choropleth generation
county_usa_republican_tbl <- county_usa_tbl %>%
left_join(county_republican_voting , by = "countyID")
county_usa_republican_tbl
There is an awful lot of dark blue color (0%) over several states
county_usa_republican_tbl
# create a subregion to categorize choropleths
county_usa_republican_tbl %>%
ggplot(aes(long, lat, group = region)) +
geom_map(
aes(map_id = region),
map = county_usa_tbl,
color = "gray80", fill = "gray30", size = 0.3
) +
coord_map("ortho", orientation = c(38, -98, 0)) +
geom_polygon(aes(group = group, fill = pct_votes), color = "black") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 50) +
theme_minimal() +
labs(title = "County Republican Voting Percentages in 2020",
x = "", y = "", fill = ""
) +
theme(
plot.title = element_text(size = 15, face = "bold", color = "red"),
legend.position = "bottom"
)
Discovery: In this database, not every state reported county files with a “TOTAL” mode. Approx 10 states reported county votes as a subsets breakdown of the modes. These counties are shown in blue and have not been included dorrectly in the choropleth data. Utah is the exception. It has reported both Total and Mode variables. Therefore we need to restructure the data as those states reporting counties_with_totals and those counties_without_totals.
Note: mode variable
We may have to use some TIDY manipulation of the data tables.
We may also need to segment the data and recalculate the Total votes.
One table represents those results that have “Totals” and one that do not have “Totals”.
# Break down into the different ways the counties reported votes
counties_with_totals <- county_election_results %>%
filter(mode == "TOTAL")
# use TIDY procedure to pivot long
# after pivot change NA to 0 and sum rows to new variable
counties_without_totals <- county_election_results %>%
filter(mode != "TOTAL") %>%
pivot_wider(names_from = mode, values_from = candidatevotes) %>%
replace(is.na(.), 0)
# discovered that UT has both TOTAL and mode fields. Need to remove from data set. Utah will be treated as "Total" county reporter
counties_without_totals <- counties_without_totals %>%
filter(state_po != "UT")
counties_without_totals
We use the TIDY command pivot_wider to reshape the table. It will reshape the table to include the multiple “modes” as individual columns. Then we can summarise the vote percentages. We will also have to group the percentages into a single county total similar to the way the “Total” variable is structured.
# Reshape table pivoting on modes
# use TIDY procedure to pivot long
# after pivot change NA to 0 and sum rows to new variable
# sum up total county votes and add new percentage variable
# select only those records from counties that report modes
counties_without_totals_reshaped <- counties_without_totals %>%
mutate(total_county_vote = rowSums(.[13:27])) %>%
mutate(pct_county_vote = as.numeric(round(((total_county_vote/totalvotes)*100), digits = 3 )))
# Use the grouping function to sum up
counties_without_totals_reshaped_group <- counties_without_totals_reshaped %>%
group_by(party, countyID, sum(total_county_vote))
# narrow to just voting percentage by Republicans by county
# filter the data for the totals of votes for the republican candidate
# mode data segments by total, early, mail, election day. Filter to collect totals
county_republican_voting_reshaped <- counties_without_totals_reshaped %>%
select(2, 6, 9:10, 13, 28:29) %>%
filter(party == "REPUBLICAN") %>%
group_by(countyID, party) %>%
summarise(sum_cty_pct = sum(pct_county_vote))
# create a grouped view of the Republican county data
county_republican_voting_reshaped_group <- county_republican_voting_reshaped
# join with the county map table
county_usa_republican_tbl_reshaped <- county_usa_tbl %>%
left_join(county_republican_voting_reshaped_group , by = "countyID") %>%
filter(party == "REPUBLICAN")
county_usa_republican_tbl_reshaped
We now have a choropleth of the state counties that use mode for tallying votes. We can use ggplot2 to see if the data seems to be improved.
# create a region to categorize choropleths
county_usa_republican_tbl_reshaped %>%
ggplot(aes(long, lat, group = group)) +
geom_map(
aes(map_id = region),
map = county_usa_tbl,
color = "gray80", fill = "gray30", size = 0.3
) +
coord_map("ortho", orientation = c(38, -98, 0)) +
geom_polygon(aes(group = group, fill = sum_cty_pct), color = "black") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 50) +
theme_minimal() +
labs(title = "County Republican Voting Percentages in 2020",
x = "", y = "", fill = ""
) +
theme(
plot.title = element_text(size = 15, face = "bold", color = "red"),
legend.position = "bottom"
)
Start with a review of Republican County results. I’ve included a snapshot of the collective 159 counties in Georgia.
georgia_republican_county_results <- county_usa_republican_tbl_reshaped %>%
filter(region == "georgia")
georgia_republican_county_results
Note: Counting is grouped by mode. Again, there is no “Total” mode for Georgia
georgia_republican_county_results
# create a subregion to categorize choropleths
georgia_republican_county_results %>%
ggplot(aes(long, lat, group = countyID)) +
geom_map(
aes(map_id = subregion),
map = county_usa_tbl,
color = "gray80", fill = "gray30", size = 0.3
) +
coord_map("ortho", orientation = c(38, -98, 0)) +
geom_polygon(aes(group = group, fill = sum_cty_pct), color = "black") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 50) +
theme_minimal() +
labs(title = "Georgia Republican Voting Percentages in 2020",
x = "", y = "", fill = ""
) +
theme(
plot.title = element_text(size = 15, face = "bold", color = "red"),
legend.position = "bottom"
)
NEEDS WORK: Note: NOT LOOKING AT TOTAL. LOOKING AT BREAKDOWN BY MODE
Voting mode represents Absentee, Advanced Voting, Election Day, and Provisional statuses. The pct_votes represent the percentage of votes by a candidate obtained by the way the vote was tallied (by mode).
georgia <- county_election_results %>%
filter(state_po == "GA") %>%
filter(party == "DEMOCRAT" | party == "REPUBLICAN") %>%
select(5, 9:11, 13:14)
georgia
ga_absentee <- georgia %>%
filter(mode == "ABSENTEE")%>%
arrange(subregion)
# Grouped
ggplot(ga_absentee, aes(fill=party, y=pct_votes, x=subregion)) +
geom_bar(position="dodge", stat="identity")+
scale_fill_manual(values=c("blue",
"red")) +
labs(title="Absentee Voting",
subtitle="Georgia Presidential Data",
caption="Source: County Presidential Election Data") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(x = "Counties (All 152)", y = "Percent of Vote")
# ga_absentee
ga_advanced <- georgia %>%
filter(mode == "ADVANCED VOTING")%>%
arrange(subregion)
# Grouped
ggplot(ga_advanced, aes(fill=party, y=pct_votes, x=subregion)) +
geom_bar(position="dodge", stat="identity")+
scale_fill_manual(values=c("blue",
"red")) +
labs(title="Advanced Voting",
subtitle="Georgia Presidential Data",
caption="Source: County Presidential Election Data") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
labs(x = "Counties (All 152)", y = "Percent of Vote")
# ga_advanced
ga_election_day <- georgia %>%
filter(mode == "ELECTION DAY")%>%
arrange(subregion )
# Grouped
ggplot(ga_election_day, aes(fill=party, y=pct_votes, x=subregion)) +
geom_bar(position="dodge", stat="identity")+
scale_fill_manual(values=c("blue",
"red")) +
labs(title="Election Day Voting",
subtitle="Georgia Presidential Data",
caption="Source: County Presidential Election Data") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
labs(x = "Counties (All 152)", y = "Percent of Vote")
# ga_election_day
ga_provisional <- georgia %>%
filter(mode == "PROV")%>%
arrange(desc(candidatevotes))
# Grouped
ggplot(ga_provisional, aes(fill=party, y=pct_votes, x=subregion)) +
geom_bar(position="dodge", stat="identity")+
scale_fill_manual(values=c("blue",
"red")) +
labs(title="Provisional Voting",
subtitle="Georgia Presidential Data",
caption="Source: County Presidential Election Data") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
labs(x = "Counties (All 152)", y = "Percent of Vote")
# ga_provisional
Here again the choloropleth shows a blueish color indicating the the data may not be correct,
As it turns out we have another state that reported by different modes. In this case it was only two, election day and absentee.
iowa <- county_election_results %>%
filter(state_po == "IA") %>%
filter(party == "DEMOCRAT" | party == "REPUBLICAN") %>%
select(5, 9:11, 13:14)
iowa
iowa_republican_county_results <- county_usa_republican_tbl_reshaped %>%
filter(region == "iowa")
# iowa_republican_county_results
# create a subregion to categorize choropleths
iowa_republican_county_results %>%
ggplot(aes(long, lat, group = countyID)) +
geom_map(
aes(map_id = subregion),
map = county_usa_tbl,
color = "gray80", fill = "gray30", size = 0.3
) +
coord_map("ortho", orientation = c(38, -98, 0)) +
geom_polygon(aes(group = group, fill = sum_cty_pct), color = "black") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 50) +
theme_minimal() +
labs(title = "Iowa Republican Voting Percentages in 2020",
x = "", y = "", fill = ""
) +
theme(
plot.title = element_text(size = 15, face = "bold", color = "red"),
legend.position = "bottom"
)
ia_absentee <- iowa %>%
filter(mode == "ABSENTEE")%>%
arrange(subregion)
# Grouped
ggplot(ia_absentee, aes(fill=party, y=pct_votes, x=subregion)) +
geom_bar(position="dodge", stat="identity")+
scale_fill_manual(values=c("blue",
"red")) +
labs(title="Absentee Voting",
subtitle="Iowa Presidential Data",
caption="Source: County Presidential Election Data") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(x = "Counties (All 99)", y = "Percent of Vote")
# ia_absentee
ia_election_day <- iowa %>%
filter(mode == "ELECTION DAY")%>%
arrange(subregion )
# Grouped
ggplot(ia_election_day, aes(fill=party, y=pct_votes, x=subregion)) +
geom_bar(position="dodge", stat="identity")+
scale_fill_manual(values=c("blue",
"red")) +
labs(title="Election Day Voting",
subtitle="Iowa Presidential Data",
caption="Source: County Presidential Election Data") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
labs(x = "Counties (All 99)", y = "Percent of Vote")
# ia_election_day